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Abstract. Inspired by the legacy of the Netflix contest, we provide an 
overview of what has been learned — from our own efforts, and those 
of others — concerning the problems of collaborative filtering and rec- 
ommender systems. The data set consists of about 100 million movie 
ratings (from 1 to 5 stars) involving some 480 thousand users and some 
18 thousand movies; the associated ratings matrix is about 99% sparse. 
The goal is to predict ratings that users will give to movies; systems 
which can do this accurately have significant commercial applications, 
particularly on the world wide web. We discuss, in some detail, ap- 
proaches to "baseline" modeling, singular value decomposition (SVD), 
as well as kNN (nearest neighbor) and neural network models; temporal 
effects, cross-validation issues, ensemble methods and other considera- 
tions are discussed as well. We compare existing models in a search for 
new models, and also discuss the mission-critical issues of penalization 
and parameter shrinkage which arise when the dimensions of a pa- 
rameter space reaches into the millions. Although much work on such 
problems has been carried out by the computer science and machine 
learning communities, our goal here is to address a statistical audience, 
and to provide a primarily statistical treatment of the lessons that have 
been learned from this remarkable set of data. 
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1. INTRODUCTION AND SUMMARY 

In what turned out to be an invaluable contribu- 
tion to the research community, Netflix Inc. of Los 
Gatos, California, on October 2, 2006, publicly re- 
leased a remarkable set of data, and offered a Grand 
Prize of one million US dollars to the person or team 
who could succeed in modeling this data to within 
a certain precisely defined predictive specification. 
While this contest attracted attention from many 
quarters — and most notably from within the com- 
puter science and artificial intelligence communities — 
the heart of this contest was a problem of statisti- 
cal modeling, in a context known as collaborative 
filtering. Our goal in this paper is to provide a dis- 
cussion and overview — from a primarily statistical 
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viewpoint — of some of the lessons for statistics which 
emerged from this contest and its data set. This 
vantage will also allow us to search for alternative 
approaches for analyzing such data (while noting 
some open problems), as well as to attempt to un- 
derstand the commonalities and interplay among 
the various methods that key contestants have pro- 
posed. 

Netflix, the world's largest internet-based movie 
rental company, maintains a data base of ratings 
their users have assigned (from 1 "star" to 5 "stars" ) 
to movies they have seen. The intended use of this 
data is toward producing a system for recommend- 
ing movies to users based on predicting how much 
someone is going to like or dislike any particular mov- 
ie. Such predictions can be carried out using infor- 
mation on how much a user liked or disliked other 
movies they have rated, together with information 
on how much other users liked or disliked those same, 
as well as other, movies. Such recommender sys- 
tems, when sufficiently accurate, have considerable 
commercial value, particularly in the context of the 
world wide web. 

The precise specifications of the Netflix data are 
a bit involved, and we postpone our description of 
it to Section 2. Briefly, however, the training data 
consists of some 100 million ratings made by approx- 
imately 480,000 users, and involving some 18,000 
movies. (The corresponding "matrix" of user-by- 
movie ratings is thus almost 99% sparse.) A subset 
of about 1.5 million ratings of the training set, called 
the probe subset, was identified. A further data set, 
called the qualifying data was also supplied; it was 
divided into two approximately equal halves, called 
the quiz and test subsets, each consisting of about 
1.5 million cases, but with the ratings withheld. The 
probe, quiz and test sets were constructed to have 
similar statistical properties. 1 

The Netflix contest was based on a root mean 
squared error (RMSE) criterion applied to the three 
million predictions required for the qualifying data. 
If one naively uses the overall average rating for each 
movie on the training data (with the probe subset 
removed) to make the predictions, then the RMSE 
attained is either 1.0104, 1.0528 or 1.0540, respec- 
tively, depending on whether it is evaluated in sam- 
ple (i.e., on the training set), on the probe set or 
on the quiz set. Netflix's own recommender system, 



1 Readers unfamiliar with the Netflix contest may find it 
helpful to consult the more detailed description of the data 
given in Section 2. 



called Cinematch, which is known to be based on 
computer-intensive but "straightforward linear sta- 
tistical models with a lot of data conditioning" is 
known to attain (after fitting on the training data) 
an RMSE of either 0.9514 or 0.9525, on the quiz 
and test sets, respectively. (See Bennett and Lan- 
ning, 2007.) These values represent, approximately, 
a 9^% improvement over the naive movie-average 
predictor. The contest's Grand Prize of one million 
US dollars was offered to anyone who could first 2 im- 
prove the predictions so as to attain an RMSE value 
of not more than 90% of 0.9525, namely, 0.8572, or 
better, on the test set. 

The Netflix contest began on Oct 2, 2006, and was 
to run until at least Oct 2, 2011, or until the Grand 
Prize was awarded. More than 50,000 contestants 
internationally participated in this contest. Yearly 
Progress Prizes of $50,000 US were offered for the 
best improvement of at least 1% over the previous 
year's result. The Progress Prizes for 2007 and 2008 
were won, respectively, by teams named "BellKor" 
and "BellKor in BigChaos." Finally, on July 26, 
2009, the Grand Prize winning entry was submitted 
by the "BellKor's Pragmatic Chaos" team, attain- 
ing RMSE values of 0.8554 and 0.8567 on the quiz 
and test sets, respectively, with the latter value rep- 
resenting a 10.06% improvement over the contest's 
baseline. Twenty minutes after that submission (and 
in accordance with the Last Call rules of the con- 
test) a competing submission was made by "The 
Ensemble" — an amalgamation of many teams — who 
attained an RMSE value of 0.8553 on the quiz set, 
and an RMSE value of 0.8567 on the test set. To 
two additional decimal places, the RMSE values at- 
tained on the test set were 0.856704 by the winners, 
and 0.856714 by the runners up. Since the contest 
rules were based on test set RMSE, and also were 
limited to four decimal places, these two submissions 
were in fact a tie. It is therefore the order in which 
these submissions were received that determined the 
winner; following the rules, the prize went to the ear- 
lier submission. Fearing legal consequences, a sec- 
ond and related contest which Netflix had planned 
to hold was canceled when it was pointed out by 
a probabilistic argument (see Narayanan and Shma- 
tikov, 2008) that, in spite of the precautions taken 
to preserve anonymity, it might theoretically be pos- 
sible to identify some users on the basis of the seem- 
ingly limited information in the data. 



2 Strictly, our use of "first" here is slightly inaccurate owing 
to a Last Call rule of the competition. 
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Of course, nonrepeatability, and other vagaries of 
ratings by humans, itself imposes some lower bound 
on the accuracy that can be expected from any rec- 
ommender system, regardless of how ingenious it 
may be. It now appears that the 10% improvement 
Netflix required to win the contest is close to the 
best that can be attained for this data. It seems 
fair to say that Netflix technical staff possessed "for- 
tuitous insight" in setting the contest bar precisely 
where it did (i.e., 0.8572 RMSE); they also were well 
aware that this goal, even if attainable, would not 
be easy to achieve. 

The Netflix contest has come and gone; in this 
story, significant contributions were made by Yehuda 
Koren and by "BellKor" (R. Bell, Y. Koren, C. Volin- 
sky), "BigChaos" (M. Jahrer, A. Toscher), larger 
teams called "The Ensemble" and "Grand Prize," 
"Gravity" (G. Takacs, I. Pilaszy, B. Nemeth, 
D. Tikk), "ML@UToronto" (G. Hinton, A. Mnih, 
R. Salakhutdinov; "ML" stands for "machine learn- 
ing"), lone contestant Arkadiusz Paterek, "Pragmat- 
ic Theory" (M. Chabbert, M. Piotte) and many oth- 
ers. Noteworthy of the contest was the craftiness 
of some participants, and the open collaboration of 
others. Among such stories, one that stands out is 
that of Brandyn Webb, a "cybernetic epistemolo- 
gist" having the alias Simon Funk (see Piatetsky, 
2007). He was the first to publicly reveal use of the 
SVD model together with a simple algorithm for its 
implementation that allowed him to attain a good 
early score in the contest (0.8914 on the quiz set). 
He also maintains an engaging website at http : / / 
sifter. org/ ~ s imon/ j ournal. 

Although inspired by it, our emphasis in this pa- 
per is not on the contest itself, but on the funda- 
mentally different individual techniques which con- 
tribute to effective collaborative filtering systems 
and, in particular, on the statistical ideas which un- 
derpin them. Thus, in Section 2, we first provide 
a careful description of the Netflix data, as well as 
a number of graphical displays. In Section 3 we es- 
tablish the notation we will use consistently through- 
out, and also include a table summarizing the perfor- 
mance of many of the methods discussed. Sections 4, 
5, 6 and 7 then focus on four key "stand-alone" tech- 
niques applicable to the Netflix data. Specifically, 
in Section 4 we discuss ANOVA techniques which 
provide a baseline for most other methods. In Sec- 
tion 5 we discuss the singular value decomposition 
or SVD (also known as the latent factor model, or 
matrix factorization) which is arguably the most ef- 
fective single procedure for collaborative filtering. 



A fundamentally different paradigm is based on neu- 
ral networks — in particular, the restricted Boltzman 
machines (RBM) — which we describe in Section 6. 
Last of these stand-alone methods are the nearest 
neighbor or kNN methods which are the subject of 
Section 7. 

Most of the methods that have been devised for 
collaborative filtering involve parameterizations of 
very high dimension. Furthermore, many of the mod- 
els are based on subtle and substantive contextual 
insight. This leads us, in Section 8, to undertake 
a discussion of the issues involved in dimension re- 
duction, specifically penalization and parameter 
shrinkage. In Section 9 we digress briefly to describe 
certain temporal issues that arise, but we return 
to our main discussion in Section 10 where, after 
exploring their comparative properties, and taking 
stock of the lessons learned from the ANOVA, SVD, 
RBM and kNN models, we speculate on the requi- 
site characteristics of effective models as we search 
for new model classes. 

In response to the Netflix challenge, subtle, new 
and imaginative models were proposed by many con- 
test participants. A selection of those ideas is sum- 
marized in Section 11. At the end, however, winning 
the actual contest proved not to be possible with- 
out the use of many hybrid models, and without 
combining the results from many prediction meth- 
ods. This ensemble aspect of combining many proce- 
dures is discussed briefly in Section 11. Significant 
computational issues are involved in a data set of 
this magnitude; some numerical issues are described 
briefly in Section 13. Finally, in Section 14, we sum- 
marize some of the statistical lessons learned, and 
briefly note a few open problems. In large part be- 
cause of the Netflix contest, the research literature 
on such problems is now sufficiently extensive that 
a complete listing is not feasible; however, we do 
include a broadly representative bibliography. For 
earlier background and reviews, see, for example, 
ACM SIGKDD (2007), Adomavicius and Tuzhilin 
(2005), Bell et al. (2009), Hill et al. (1995), Hoffman 
(2001b), Marlin (2004), Netflix (2006/2010), Park 
and Pennock (2007), Pu et al. (2008), Resnick and 
Varian (1997) and Tuzhilin at al. (2008). 

2. THE NETFLIX DATA 

In this section we provide a more detailed overview 
of the Netflix data; these in fact consist of two key 
components, namely, a training set and a qualify- 
ing set. The qualifying data set itself consists of 
two halves, called the quiz set and the test set; fur- 
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thermore, a particular subset of the training set, 
called the probe set, was identified. The quiz, test 
and probe subsets were produced by a random three 
way split of a certain collection of data, and so were 
intended to have identical statistical properties. 

The main component of the Netflix data — namely, 
the "training" set — can be thought of as a matrix of 
ratings consisting of 480,189 rows, corresponding to 
randomly selected anonymous users from Netflix's 
customer base, and 17,770 columns, corresponding 
to movie titles. This matrix is 98.8% sparse; out of 
a possible 480,189 x 17,770 = 8,532,958,530 entries, 
only 100,480,507 ratings are actually available. Each 
such rating is an integer value (a number of "stars" ) 
between 1 (worst) and 5 (best). The data were col- 
lected between October 1998 and December 2005, 
and reflect the distribution of all ratings received 
by Netflix during that period. It is known that Net- 
flix's own database consisted of over 1.9 billion rat- 
ings, on over 85,000 movies, from over 11.7 million 
subscribers; see Bennett and Lanning (2007). 

In addition to the training set, a qualifying data 
set consisting of 2,817,131 user-movie pairs was also 
provided, but with the ratings withheld. It consists of 
two halves: a quiz set, consisting of 1,408,342 user- 
movie pairs, and a test set, consisting of 1,408,789 
pairs; these subsets were not identified. Contestants 
were required to submit predicted ratings for the en- 
tire qualifying set. To provide feedback to all partici- 
pants, each time a contestant submitted a set of pre- 
dictions Netflix made public the RMSE value they 
attained on a web-based Leaderboard, but only for 
the quiz subset. Prizes, however, were to be awarded 
on the basis of RMSE values attained on the test sub- 
set. The purpose of this was to prevent contestants 
from tuning their algorithms on the "answer oracle." 

Netflix also provided the dates on which each of 
the ratings in the data sets were made. The rea- 
son for this is that Netflix is more interested in 
predicting future ratings than in explaining those 
of the past. Consequently, the qualifying data set 
had been selected from among the most recent rat- 
ings that were made. However, to allow contestants 
to understand the sampling characteristics of the 
qualifying data set, Netflix identified the probe sub- 
set of 1,408,395 user-movie pairs within the training 
set (and hence with known ratings) , whose distribu- 
tional properties were meant to match those of the 
qualifying data set. (The quiz, test and probe sub- 
sets were produced from the random three-way split 
already mentioned.) As a final point, prior to re- 
leasing their data, Netflix applied some statistically 



neutral perturbations (such as deletions, changes of 
dates and/or ratings) to try to protect the confiden- 
tiality and proprietary nature of its client base. 

In our discussions, the term "training set" will 
generally refer to the training data, but with the 
probe subset removed; this terminology is in line 
with common usage when a subset is held out dur- 
ing a statistical fitting process. Of course, for pro- 
ducing predictions to submit to Netflix, contestants 
would normally retrain their algorithms on the full 
training set (i.e., with probe subset included). As the 
subject of our paper is more concerned with collabo- 
rative filtering generally, rather than with the actual 
contest, we will make only limited reference to the 
qualifying data set, and mainly in our discussion on 
implicit data in Section 11, or when indicating cer- 
tain scores that contestants achieved. 

Finally, we mention that Netflix also provided the 
titles, as well as the release years, for all of the 
movies. In producing predictions for its internal use, 
Netflix's Cinematch algorithm does make use of other 
data sources, such as (presumably) geographical, or 
other information about its customers, and this al- 
lows it to achieve substantial improvements in RMSE. 
However, it is known that Cinematch does not use 
names of the movies, or dates of ratings. In any 
case, to produce the RMSE values on which the con- 
test was to be based, Cinematch was trained with- 
out any such other data. Nevertheless, no restric- 
tions were placed on contestants from using external 
sources, as, for instance, other databases pertaining 
to movies. Interestingly however, none of the top 
contestants made use of any such auxiliary informa- 
tion. 3 

Figures 1 through 6 provide some visualizations of 
the data. Figure 1 gives histograms for the ratings 
in the training set, and in the probe set. Reflecting 
temporal effects to be discussed in Section 9 (but see 
also Figure 5), the overall mean rating, 3.6736, of 
the probe set is significantly higher than the overall 
mean, 3.6033, of the training set. Figures 2 and 3 
are plots (in lieu of histograms) of the cumulative 
number of ratings in the training, probe and qual- 
ifying sets. Figure 2 is cumulative by movies (hori- 
zontal axis, and sorted from most to least rated in 
the training set), while Figure 3 is cumulative by 



' This is not to say they did not try. But perhaps surpris- 
ingly — with the possible exception of a more specific release 
date — such auxiliary data did not improve RMSE. One pos- 
sible explanation for this is that the Netflix data set is large 
enough to proxy such auxiliary information internally. 
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Histograms of Ratings in the Training and Probe Data 
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FlG. 1. Frequency histograms for ratings in the training set 
(white) and probe set (black). 



Cumulative Ratings by Movie 
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Fig. 2. Cumulative proportion of ratings, by movies, for the 
training, probe and qualifying sets. Movies are on the hori- 
zontal axis, sorted from most to least rated. 



Cumulative Ratings by User 
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Fig. 3. Cumulative proportion of ratings, by users, for the 
training, probe and qualifying sets. Users are on the horizontal 
axis, sorted by number of movies rated (from most to least). 

Histograms of Mean Ratings 
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Fig. 4. Histograms for mean movie ratings (bars) and mean 
user ratings (line with points) in the training set. 



users. The steep rise in Figure 2 indicates, for in- 
stance, that the 100 and 1000 most rated movies 
account for over 14.3% and 62.5% of the ratings, 
respectively. In fact, the most rated 4 movie (Miss 
Congeniality) was rated by almost half the users in 
the training set, while the least rated was rated only 
3 times. Figure 2 also evidences a slight — although 
statistically significant — difference in the profiles for 
the training and the qualifying (and probe) data. 



4 We remark here that rented movies can be rated without 
having been watched. 



In Figure 3, the considerable mismatch between the 
curve for the training data, with the curves for the 
probe and qualifying sets which match closely, re- 
flects the fact that the representation of user view- 
ership in the training set is markedly different from 
that of the cases for which predictions are required; 
clearly, Netflix constructed the qualifying data to 
have a much more uniform distribution of user view- 
er ship. 

Figure 4 provides histograms for the movie mean 
ratings and the user mean ratings. The reason for 
the evident mismatch between these two histograms 
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Ratings by Time 
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Time 

Fig. 5. Temporal effects: Histograms for the dates (quar- 
terly) of ratings in the training set (white bars) and qualifying 
set (inlaid black bars). The line with points shows quarterly 
mean ratings for the training data (scale on right). 

is that the best rated movies were watched by dis- 
proportionately large numbers of users. Finally, Fig- 
ure 5 exemplifies some noteworthy temporal effects 
in the data. Histograms are shown for the number 
of ratings, quarterly, in the training and in the qual- 
ifying data sets, with the dates in the qualifying set 
being much later than the dates in the training set. 
Figure 5 also shows a graph of the quarterly mean 
ratings for the training data (with the scale being 
at the right). Clearly, a significant rise in mean rat- 
ing occurred starting around the beginning of 2004. 
Whether this occurred due to changes in the ratings 
definitions, to the introduction of a recommender 
system, to a change in customer profile, or due to 
some other reason, is not known. 

Finally, Figure 6 plots the mean movie ratings 
against the (log) number of ratings. (Only a ran- 
dom sample of movies is used so as not to clutter 
the plot.) The more rated movies do tend to have 
the higher mean ratings but with some notable ex- 
ceptions, particularly among the less rated movies 
which can sometimes have very high mean ratings. 

As a general comment, the layout of the Netflix 
data contains enormous variation. While the average 
number of ratings per user is 209, and the average 
number of ratings per movie is 5654.5 (over the en- 
tire training set), five users rated over 10,000 movies 
each, while many rated fewer than 5 movies. Like- 
wise, while some movies were rated tens of thou- 
sands of times, most were rated fewer than 1000 
times, and many less than 200 times. The extent of 
this variation implies large differences in the accu- 



Mean Movie Ratings 



4.5 - 




3 4 5 6 7 8 9 10 11 12 

Log Number of Ratings 

Fig. 6. Mean movie ratings versus (logarithm of) number of 
ratings (for a random subset of movies). 

racies with which user and movie parameters can 
be estimated, a problem which is particularly severe 
for the users. Such extreme variation is among the 
features of typical collaborative filtering data which 
complicate their analyses. 

3. NOTATION AND A SUMMARY TABLE 

In this section we establish the notation we will 
adhere to in our discussions throughout this paper. 
We also include a table, which will be referred to in 
the sequel, of RMSE performance for many of the 
fitting methods we will discuss. 

Turning to notation, we will let i = 1, 2, . . . , I range 
over the users (or their indices) and j = 1, 2, . . . , J 
range over the movies (or their indices). For the Net- 
flix training data, I = 480,189 and J = 17,770. Next, 
we will let J(i) be the set of movies rated by user i 
and be the set of users who rated movie j. The 
cardinalities of these sets will be denoted variously 
as Ji = \ J(i)\ and Ij = We shall also use the 

notation C for the set of all user-movie pairs 
whose ratings are given. Denoting the total number 
of user-movie ratings in the training set by N, note 
that N = \C\ = Yli=i = S/=i Ij- The ratings are 
made on an ordered scale (such scales are known 
as "Likert scales") and are coded as integers hav- 
ing values k = 1, 2, . . . ,K; for Netflix, K = 5. The 
actual ratings themselves, for £ C, will be de- 
noted by ri a. Averages of rj , over i E over 
j £ J(i), or over C (i.e., over movies, or users, or 
over the entire training data set) will be denoted 
by r.j, r^. and r v respectively. Estimated values 



NETFLIX CHALLENGE 



7 



Table 1 

RMSE values attained by various methods 



Predictive model 



RMSE Remarks and references 



H 1.1296 

on 1.0688 

/3 j 1.0528 

= fj, + on + Pj , naive 0.9945 

,,j = n + on + P 3 0.9841 

"Global effects" 0.9657 

Cinematch, on quiz set 0.9514 

Cinematch, on test set 0.9525 

kNN 0.9174 

"Global" + SVD 0.9167 

SVD 0.9167 
"Global" + SVD + "joint kNN" 0.9071 
"Global" + SVD + "joint kNN" 0.8982 

Simon Funk 0.8914 
TemporalDynamics + SVD++ 0.8799 
Arkadiusz Paterek's best score 0.8789 

ML Team: RBM + SVD 0.8787 

Gravity's best score 0.8743 

Progress Prize, 2007, quiz 0.8712 

Progress Prize, 2007, test 0.8723 

Progress Prize, 2008, quiz 0.8616 

Progress Prize, 2008, test 0.8627 

Grand Prize, target 0.8572 

Grand Prize, runner up 0.8553 

Grand Prize, runner up 0.8567 

Grand Prize, winner 0.8554 

Grand Prize, winner 0.8567 



RMSE on probe set, using mean of training set 

Predict by user's training mean, on probe set 

Predict by movie's training mean, on probe set 

Two-way ANOVA, no iteraction 

Two-way ANOVA, no iteraction 

Bell and Koren (2007a, 2007b, 2007c) 

As reported by Netflix 

Target is to beat this by 10% 

Bell and Koren (2007a, 2007b, 2007c) 

Bell and Koren (2007a, 2007b, 2007c) 

Bell and Koren (2007a, 2007b, 2007c), on probe set 

Bell and Koren (2007a, 2007b, 2007c), on probe set 

Bell and Koren (2007a, 2007b, 2007c), on quiz set 

An early submission; Leaderboard 

Koren (2009) 

An ensemble of many methods; Leaderboard 

See Section 6; Leaderboard 

November 2007; Leaderboard 

Bell, Koren and Volinsky (2007a, 2007b, 2007c) 

As above, but on the test set 

Bell, Koren and Volinsky (2008), Toscher and Jahrer (2008) 

As above, but on the test set 

10 % below Cinematch's RMSE on test set 

The Ensemble, 20 minutes too late; on quiz set 

As above, but on the test set 

BellKor + BigChaos + PragmaticTheory, on quiz set 
As above, but on the test set 



Selected RMSE values, compiled from various sources. Except as noted, RMSE values shown are either for 
the probe set after fitting on the training data with the probe set held out, or for the quiz set (typically 
from the Netflix Leaderboard) after fitting on the training data with the probe set included. 



are denoted by "hats" as in fij, which may refer to 
the fitted value from a model when SC, or to 
a predicted value otherwise. Many of the procedures 
we discuss are typically fitted to the residuals from 
a baseline fit such as an ANOVA; where this causes 
no confusion, we continue using the notations r^j 
and fij in referring to such residuals. Some proce- 
dures, however, involve both the original ratings as 
well as their residuals from other fits; in such cases, 
the residuals are denoted as j and e^j . Finally, the 
notation I(j,f) will refer to the set of all users who 
saw both movies j and j' , and J(i, i') will refer to the 
set of movies that were seen by both users i and i! . 

Finally, we also include, in this section, a table 
which provides a summary, compiled from multiple 
sources, of the RMSE values attained by many of the 
methods discussed in this paper. The RMSE values 
shown in Table 1 are typically for the probe set, after 
fitting on the remainder of the training set; or where 
known, on the quiz set, after fitting on the entire 



training set; but exceptions to this are noted. Ref- 
erences to the "Leaderboard" refer to performance 
on the quiz set publicly released by Neflix. We refer 
to Table 1 in our subsequent discussions. 

4. ANOVA BASELINES 

ANOVA methods furnish baselines for many anal- 
yses. One basic approach — referred to as preprocess- 
ing — involves first removing global effects such as 
user and movie means, and using the residuals as in- 
put to subsequent models. Alternatively, such "row" 
and "column" effects can be incorporated directly 
into those models where they are sometimes referred 
to as biases. In any case, most models work best 
when global effects are explicitly accounted for. In 
this section we discuss minimizing the sum of squared 
errors criterion 

(4.1) £!>,;- r^-) 2 
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using various ANOVA methods for the predictions 
fij of the user-movie ratings rj,-. Due to the large 
number of parameters, regularization (i.e., penaliza- 
tion) would normally be used, but we reserve our 
discussions of regularization issues to Section 8. 

We first note that the best fitting model of the 
form 

(4.2) nj = n for alH, j, 

obtained by setting fi = 3.6033, the mean of all 
user-movie ratings in the training set (with probe 
removed), results in an RMSE on the training set 
equal to its standard deviation 1.0846; on the probe 
set, using this same (i results in an RMSE of 1.1296, 
although the actual mean and standard deviation 
for the probe set 5 are 3.6736 and 1.1274. 

Next, if we predict each rating by the mean rating 
for that user on the training set, thus fitting the 
model 

(4.3) fij = n + ati, 

we obtain an RMSE of 0.9923 on the training set, 
and 1.0688 using the same values on the probe. If, 
instead, we predict each rating by the mean for that 
movie, thus fitting 

(4.4) rij = l* + Pj, 

we obtain RMSE values 1.0104 and 1.0528 on the 
training and probe sets, respectively. The solutions 
for (4.2)-(4.4) are just the least squares fits associ- 
ated with 

(<j)ec 

respectively, where C is the set of indices over 
the training set. Histograms of the user and movie 
means were given in Figure 4; we note, for later use, 
that the the variances of the user and movie means 
on the test set (with probe removed) are 0.23074 
and 0.27630, corresponding to standard deviations 
of 0.48035 and 0.52564, respectively. 6 



5 Note that the difference between the squares of the probe's 
1.1296 and 1.1274 RMSE values must equal the squared dif- 
ference between the two means, 3.6736 and 3.6033. 

6 These values are useful for assessing regularization issues; 
see Section 8. 



We now consider two-factor models of the form 

(4.6) f it j = fi + ai + fij. 

Identifiability conditions, such as J2i Q i = and 
Y^j fij = 0, would normally be imposed, although 
they become unnecessary under typical regulariza- 
tion. If we were to proceed as in a balanced two-way 
layout (i.e., with no ratings missing), then we would 
first estimate \i as the mean of all available ratings; 
the values of a« and fij would then be estimated as 
the row and column means, over the available rat- 
ings, after /z has been subtracted throughout. Doing 
this results in RMSE values of 0.9244 and 0.9945 
on the training and probe sets. If we proceed se- 
quentially, the order of the operations for estimat- 
ing the aj's and the /3-, 's will matter: If we estimate 
the ccj's first and subtract their effects before esti- 
mating the Pj's, the result will not be the same as 
first estimating the fij's and subtracting their effect 
before estimating the cVs; these procedures result, 
respectively, in RMSE values of 0.9218 and 0.9177 
on the training set. 

The layout for the Netflix data is unbalanced, with 
the vast majority of user-movie pairings not rated; 
we therefore seek to minimize 

(4.7) Yl (ru-ii-at-ft) 2 

over the training set. This quadratic criterion is con- 
vex, however, standard methods for solving the "nor- 
mal" equations, obtained by setting derivatives with 
respect to fj,, en and fij to zero, involve matrix in- 
versions which are not feasible over such high di- 
mensions. The optimization of (4.7) may, however, 
be carried out using either an EM or a gradient de- 
scent algorithm. When no penalization is imposed, 
minimizing (4.7) results in an RMSE value of 0.9161 
on the training set, and 0.9841 on the probe subset. 

A consideration when fitting (4.6) as well as other 
models is that some predicted ratings fij can fall 
outside the [1,5] range. This can occur when highly 
rated movies are rated by users prone to giving high 
ratings, or when poorly rated movies are rated by 
users prone to giving low ratings. Under optimiza- 
tion of (4.7) over the test set, approximately 5.1 mil- 
lion fij estimates fall below 1, and 19.4 million fall 
above 5. Although we may Winsorize (clip) these fij 
to lie in [1,5], clipping in advance need not be op- 
timal when residuals from a baseline fit are input 
to other procedures. We do not consider here the 
problem of minimizing (4.7) when fi + on + fij there 
is replaced by a Winsorized version. 
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Of course, not all is well here. The differences in 
RMSE values between the training and the probe 
sets reflect temporal effects, some of which were al- 
ready noted. Furthermore, these models have pa- 
rameterizations of high-dimensions and have there- 
fore been overfit, resulting in inferior predictions. 
These issues will be dealt with in Sections 8 and 9. 

Finally, we remark that interaction terms can be 
added to (4.6). The standard approach fjj = fj, + 
«i + Pj + 7i,j will not be effective, although it could 
possibly be combined with regularization. Alterna- 
tively, interactions could be based on user x movie 
groupings via "many to one" functions a(i) and b(j), 
and models such as 

(4.8) fij =n + ai + /3j + 7o(i),6(j) • 

There are many possibilities for defining such groups; 
for example, the covariates discussed in Section 11 
or nearest neighbor methods (kNN) can be used 
to construct suitable a(i) and b(j). Some further 
interaction-type ANOVA models are considered in 
Section 10. 

5. SVD METHODS 

In statistics, the singular value decomposition 
(SVD) is best known for its connection to principal 
components: If X = (Xx,Xz, . . . ,X n )' is a random 
vector of means 0, and n x n covariance matrix S, 
then one may represent S as a linear combination 
of mutually orthogonal rank 1 matrices, as in 

/? 

where Ai > A2 > • • • > A n > are ordered eigenvalues 
of S, and Pj corresponding orthonormal (column) 
eigenvectors. The principal components are the ran- 
dom variables PjX . Less commonly known is that 

the nx n matrix = Ylj=i ^jPjPj gives the best 
rank k reconstruction of S, in the sense of minimiz- 
ing the Frobenius norm ||E — 7"^'||, defined as the 
square root of the sum of the squares of its entries. 

These results generalize. If A is an arbitrary real- 
valued m x n matrix, its singular value decomposi- 
tion is given by 

A = UDV', 

where U = {U\,U2, ■ • • , U m ) is an m x m matrix whose 
columns Uj are orthonormal eigenvectors of AA' , 
where V = (Vi, V2, • • • , V n ) is an n x n matrix whose 



7 If A is complex-valued, these relations still hold, with con- 
jugate transposes replacing transposes. 



columns Vj are orthonormal eigenvectors of A' A, 
and where D is an m x n "diagonal" matrix whose 
diagonal entries may be taken as the descending or- 
der nonnegative values 

Xj = +J {eigv&\ AA'}j 



= + y {eigv&\A'A}j, j = 1, 2, . . . , mfn(m, n), 

called the singular values of A. The columns of V 
and U provide natural bases for inputs to and out- 
puts from the linear transformation A. In partic- 
ular, AVj = XjUj, A'Uj = XjVj, so given an input 
B = Ej=i Cj V^- , the corresponding output is AB = 

^min(m,n) x TJ 

2^j=l A j c j u j- 

Given an SVD of A, the Eckart- Young Theorem 
states that, for a given k < min(m, n), the best rank k 
reconstruction of A, in the sense of minimizing the 
Frobenius norm of the difference, is U^D^(V^)', 
where is the m x k matrix formed from the 

first k columns of U, 7^ is the n x k matrix formed 
by the first k columns of V, and D^> is the upper 
left k x k block of D. This reconstruction may be 
expressed in the form FG' where F is m x k and G 
is k x n; the reconstruction is thus formed from the 
inner products between the A:-vectors comprising F 
with those comprising G. These fe-vectors may be 
thought of as associated, respectively, with the rows 
and the columns of A, and (in applications) the com- 
ponents of these vectors are often referred to as fea- 
tures. A numerical consequence of the Eckart- Young 
Theorem is that "best" rank k approximations can 
be determined iteratively: given a best rank k — 1 ap- 
proximation, FG' , say, a best rank k approximation 
is obtained by attaching a column vector to each 
of F and G which provide a best fit to the residual 
matrix A — FG' . SVD algorithms can therefore be 
quite straightforward. Here, however, we are specif- 
ically concerned with algorithms applicable to ma- 
trices which are sparse. We briefly discuss two such 
algorithms, useful in collaborative filtering, namely, 
alternating least squares (ALS) and gradient de- 
scent. Some relevant references are Bell and Ko- 
ren (2007c), Bell, Koren and Volinsky (2007a), Funk 
(2006/2007), Koren, Bell and Volinsky (2009), Raiko, 
Ilin and Karhunen (2007), Srebro and Jaakkola (2003), 
Srebro, Rennie and Jaakkola (2005), Takacs et al. 
(2007, 2008a, 2008b, 2008c), Wu (2007) and Zhou 
et al. (2008). See also Hofmann (2001a, 2004), Hof- 
mann and Puzicha (1999), Kim and Yum (2005), 
Marlin and Zemel (2004), Rennie and Srebro (2005), 
Sali (2008) and Zou et al. (2006). 
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The alternating least squares (ALS) method for 
determining the best rank p reconstruction involves 
expressing the summation in the objective function 
in two ways: 



k=l 



J 



(5.1) 



h3 



y~] u i,kVj,k 



k=l 
V 



=EEl 

j=i iei(j) 
= YY [ r hi-J2 Ui > kV i> k ) ■ 

i=l j£J(i) V k=l ) 

The Ui t k may be initialized using small independent 
normal variables, say. Then, for each fixed j, we 
carry out the least squares fit for the Vj t k based on 
the inner sum in the middle expression of (5.1). And 
then, for each fixed i, we carry out the least squares 
fit for Ui t k based on the inner sum of the last ex- 
pressions in (5.1). This procedure is iterated until 
convergence; several dozen iterations typically suf- 
fice. 

ALS for SVD with regularization 8 proceeds simi- 
larly. For example, minimizing 9 

YY\ r ^ ~Y Ui ^ v iA 

C \ k=l J 

I J 

+ Ai^||^|| 2 + A 2 ^||^f 

i=l j=l 

leads to iterations which alternate between minimiz- 
ing 

(5-3) ^ ( r M ~ Y U i,k V 3,k J +Al||Uj|| 2 

iei(j) V k=i J 
with respect to the Vj and then minimizing 
/ p \ 2 



(5.2) 



( 5 - 4 ) Y n >i ~ Y Ui ' kV i' k + A 2 | 



U: 



k=l 



Although we prefer to postpone discussion of regulariza- 
tion to the unified treatment attempted in Section 8, it is 
convenient to lay out those SVD equations here. 

9 We prefer not to set Ai = A2 at the outset for reasons of 
conceptual clarity; see Section 8. In fact, because a constant 
may pass freely between user and movie features, generality is 
not lost by taking Ai = A2. Generality is lost, however, when 
these values are held constant across all features; see Section 8. 



with respect to the Uj^; these are just ridge regres- 
sion problems. 10 

ALS can also be performed one feature at a time, 
with the advantage of yielding factors in descending 
order of importance. To do this, we initialize as be- 
fore, and again arrange the order of summation in 
the objective function in two different ways; for the 
first feature, this is 



(5.5) 



Y Y ^ ~ u hi v irf 

3 = 1 



= Y Y ( r i,3 - U i,l V 3,l) 
i=l L jgJ(i) 

We then iterate between the least squares problems 
of the inner sums in (5.5), namely, 

(5.6) = u i,i r i,j/ Y u h 
for all j, and then 

(5.7) Ui,i = Y v 3A r i,j/ Y v h 

jeJ(i) jeJ{i) 

for all i, until convergence. After k — 1 features have 
been fit, we compute the residuals 

k-l 



Y Ui > £V i>' 



e=i 



and replace (5.6) and (5.7) by 

Hh= Y U i,k e i,3 Y 



l i,k 



and 



Uih 



E 



V j,k e i,j 



E 



j£.J(i) 

ranging over all j and all i, respectively. 

Regularization in one-feature-at-a-time ALS can 
be effected in several ways. Bell, Koren and Volinsky 
(2007a) shrink the residuals ejj via 
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J Some contestants preferred the regularization 



E" ;i ''•••<• + A (INI 2 + INI 2 ) 



EE 

c 

instead of (5.2), which changes the Ai and A2 in (5.3) and (5.4) 
into IjX and JiX, respectively. In Section 8 we argue that this 
modification is not theoretically optimal. 
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where rtjj = min(Jj, Jj) measures the "support" 
for rij, and they increase the shrinkage parame- 
ter Afc with each feature k. Alternately, one could 
add a regularization term 

AfcGKII 2 + ||ffcll 2 ) 

when fitting the kth. feature, choosing the A& by 
cross-validation. 

Finally, we consider gradient descent approaches 
for fitting SVD models. For an SVD of dimension p, 
say, we first initialize all un- and Vj ^ in 



the update equations become 



Then write 



k=l 



V 



k=l 



and note that 



and 



de 2 ■ 



de 2 - 
dvj >k 



-2esoV 



i,i v j,k 



2e i a%Xi 



i,j"-i,k- 



Updating can then be done locally following the neg- 
ative gradients: 



old 



J u j,k 

vf d k +2r ] e l ,utl 



and 



(5.8) 

3,k 

where the learning rate n controls for overshoot. For 
a given (i, j) G C, these equations are used to update 
the Uj jt and Vjk for all k; we then cycle over the 
(i,j) GC until convergence. If we regularize 11 the 
problem, as in 



(5.9) 



fc=l 



+A(Eiwi a +Ei 



If, instead of (5.9), we regularized as 

c 

then the gradient descent update equations (5.10) become 

new old . / rt old \ old\ n 

«i,fc = «i,fc + V( 2 ei,jVj,k - Au ijfc ) and 

new old . /r» old \ old \ 

Vj,k = v j>k + T)(2ei,jU itk - Xv jtk ). 



it 



i.k 



(5.10) 



new _„,old , / 9 old 
V j,k ~ v j,k -rV\ Ze M u i,k 



Ji 

A 

7 



old 



„old 



and 



We note that, as in ALS, there are other ways to 
sequence the updating steps in gradient descent. Si- 
mon Funk (2006/2007), for instance, trained the fea- 
tures one at a time. To train the kth feature, one ini- 
tializes the and Vj^ randomly, and then loops 
over all GC, updating the kth. feature for all 
users and all movies. The updating equations are as 
before [e.g., (5.10)] except based on residuals e^j = 

r i,j ~ Yle=i u i,£ v j/- After convergence, one proceeds 
to the next feature. 

We remark that sparse SVD problems are known 
to be nonconvex and to have multiple local min- 
ima; see, for example, Srebro and Jaakkola (2003). 
Nevertheless, starting from different initial condi- 
tions, we found that SVD seldom settled into en- 
tirely unsatisfactory minima, although the minima 
attained did vary slightly. The magnitude of these 
differences was commensurate with the variation in- 
herent among the options available for regulariza- 
tion. We also found that averaging the results from 
several SVD fits started at different initial condi- 
tions could lead to better results than a single SVD 
fit of a higher dimension. On this point, see also Wu 
(2007). Finally, we note the recent surge of work on 
a problem referred to as matrix completion; see, for 
example, Candes and Plan (2009). 

6. NEURAL NETWORKS AND RBMS 

A restricted Boltzman machine (RBM) is a neu- 
ral network consisting of one layer of visible units, 
and one layer of invisible ones; there are no con- 
nections between units within either of these lay- 
ers, but all units of one layer are connected to all 
units of the other layer. To be an RBM, these con- 
nections must be bidirectional and symmetric; some 
definitions require that the units only take on bi- 
nary values, but this restriction is unnecessary. We 
remark that the symmetry condition is only needed 
so as to simplify the training process. See Figure 7; 
additional clarification will emerge from the discus- 
sion below. The name for these networks derives 
from the fact that their governing probability dis- 
tributions are analogous to the Boltzman distribu- 
tions which arise in statistical mechanics. For fur- 
ther background, see, for example, Hertz, Krogh and 
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hidden units 



symmetric weights 



visible units 
(ratings) 



Fig. 7. The RBM model for a single user: Each of the user's 
hidden units is connected to every visible unit ( a multinomial 
observation) that represents a rating made by that user. Every 
user is associated with one RBM, and the RBM models for 
the different users are linked through the common symmetric 
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Palmer (1991), Section 7.1, Izenman (2008), Chap- 
ter 10, or Ripley (1996), Section 8.4. See also Bishop 
(1995, 2006). We will describe the RBM model that 
has been applied to the Netflix data by Salakhut- 
dinov, Mnih and Hinton (2007), whom we will also 
refer to as SMH. 

In the SMH model, to each user i, there corre- 
sponds a length F vector of hidden (i.e., unobserved) 
units, or features, h = (hi, h%, . . . , hp). These fea- 
tures, hf, for /= 1,2, ... ,F, are random variables 
posited to take on binary values, or 1. Note that 
subscripting to indicate the dependence of h on the 
ith user has been suppressed. Next, instead of think- 
ing of the ratings of the ith user as the collection of 
values rij for j G J(i), we think of this user's ratings 
as the collection of vectors Vj = (vj,Vj, . . . ,vf), for 
j £ J(i), that is, for each of the movies he or she has 
seen. Each of these vectors is defined by setting all 
of its elements to 0, except for one: namely, Vj = 1, 
corresponding to n ~ = k. Here K is the number of 
possible ratings; for Netflix, K = 5. The collection 
of these Vj vectors for our ith user [with j € J(i)] 
will be denoted by v. Here again, the dependence 
of v, as well as of the Vj and the Vj, on the user i is 
suppressed. 

We next introduce the symmetric weight param- 
eters Wf f for 1 < j < J, 1 < / < F and 1 < k < K, 
which link each of the F hidden features of a user 
with each of the J possible movies; these weights 
also carry a dependence on the rating values k. 
The Wjf are not dependent on the user; the same 
weights apply to all users, however, only weights for 
the movies he or she has rated will be relevant for 
any particular user. 



We next specify the underlying stochastic model. 
First, the distributions of the (v, h) are assumed to 
be independent across users. We therefore only need 
to specify a probability distribution on the collec- 
tion (v, h) for the ith user. This distribution is de- 
termined by its two conditional distributions mod- 
eled as follows: The conditional distribution of the 
zth user's observed ratings information v, given that 
user's hidden features vector h, is modeled as a (one- 
trial) multinomial distribution 

P(v^ = l\h) 

(6.1) 

exp(tf + E/ = iM*&) 

where the denominator is just a normalizing factor. 
Next, the conditional distributions of the ith user's 
hidden features variables, given that user's observed 
ratings v, are modeled as conditionally independent 
Bernoulli variables 



(6.2) P(h f = l\v) = a[b 



k=l J 



where cr(x) = 1/(1 + e x ) is the sigmoidal function. 
Note that (6.2) is equivalent to the linear logit model 

P(h f = l\v) 



log 



(6.3) 



l-P(h f = l\v) 



rk . 



jeJ(i) k=i 

in effect, (6.2)/(6.3) models user features in terms of 
the movies the user has rated, and the user's ratings 
for them. Note that the weights (interaction parame- 
ters) WK are assumed to act symmetrically in (6.1) 

and (6.2). The parameters bj and bf are referred to 

as biases; the bj may be initialized to the logs of 
their respective sample proportions over all users. 
We remark that in this model there is no analogue 
for user biases. 

To obtain the joint density of v and h from their 
two conditional distributions, we make use of the 
following result: Suppose f(x,y) is a joint density 
for (X, Y), and that fi(x\y), f-2(y\x) are the corre- 
sponding conditional density functions for X\Y and 
Y\X. Then noting the elementary equalities 

f2(y\x* 



f(x,y) = fi(x\y) x 



h(y\x) 



h(x*\y) 

h(x\y*) 
f2(y*\x) 



x fx(x*) 



x fv(y* 
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we see that f(x, y) can be determined from f\ and and 
since it is proportional to either of 

/.(*)* #^ and AWxM 
fi{x*\y) h\y \ x ) 

Here fx and fy are the marginals of X and Y, and 
the choices of x* and y* are arbitrary. It follows that 
the joint density of (v, h) satisfies the proportional- 
ity 

with the choice /i* = 0, this yields 

p(v, h) oc exp{— E(v, h)}, 



Now 



(6.5) 



j>f v' h' 

dlogp(v) _ d\og(Y!, h exp(-E(v,h))) 



dlogZ 
~ dWK ' 

the first term on the right in (6.5) equals 



^2exp{-E{v,h))h f v^ 



where 
E(v,h) 



jGJ(i)/= lfc=1 j6J(i)fc=l 

•E¥/+ E^ff^l 

/=i jeJ(j) \fc=l / 



£ h exp(-£(u,/i)) ^ 
ft 

while the second term on the right in (6.5) is 
1 ^ ^EE-M-E(v',h'))h' f vf 



Zdw h 



The computations here just involve taking products 
over the observed ratings using (6.1), and over the 
hidden features using (6.2). By analogy to formu- 
lae in statistical mechanics, E(v, h) is referred to as Hence, altogether, 
an energy; note that only movies whose ratings are 
known contribute to it. The joint density of (v, h) 
can therefore be expressed as 

/ ,s exp{-E(v,h)} 
P(v,h) - 



EE^»>f- 



AW 



3,f 

e 



h v' h' ' 



Z vlthl e W {-E(v',h')y 

so that the likelihood function (i.e., the marginal 
distribution for the observed data) is 



or, expressed more concisely, 

(6.6) AW}j = e((uf/l/)data " (^V)model)- 

Similarly, we obtain the updating protocols 

(6.7) Ab f = e((/l/)data - (hf)model) 
and 

(6.8) A6^ = e ((^) data -(^) model ). 

Note that the gradients here are for a single user 
only; therefore, the three updating equations (6.6), 
(6.7) and (6.8) must first be averaged over all users. 

The updating equations (6.6), (6.7) and (6.8) for 
implementing maximum likelihood "learning" in- 
volves two forms of averaging. The averaging over 
the "data," that is, based on the p(h\v), is rela- 
where e is a "learning rate." To determine AW*, we tiyely stra ightforward. However, the averaging over 
will need the derivatives 
dE(v,h) 



(6.4) p(v) = ^h e M-E(v,h)} 
We will use the notation 

v> h' 

for the denominator term of (6.4). 
Now the updating protocol for the W^ is given 



by 



AW, 



dlogp(v) 

'-dwfj- 



dW* 

JiJ 



-h f v* 



the "model" is impractical, as it requires Gibbs-type 
MCMC sampling from p(v, h) which involves iterat- 
ing between (6.1) and (6.2). SMH instead suggest 
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running this Gibbs sampler for only a small num- 
ber of steps at each stage, a procedure referred to as 
"contrastive divergence" (Hinton, 2002). For further 
details, we refer the reader to SMH. 

Numerous variations on the model defined by (6.1) 
and (6.2) are possible. In particular, the user fea- 
tures h may be modeled as Gaussian variables hav- 
ing, say, unit variances. In this case the model for 
P(vj = l\h) remains as at (6.1), but (6.2) becomes 

P(h f = h\v) 



1 



exp< 



-(*-»/ 



K 



E E"M/ 



jej(i) k=i 

The marginal distribution p(v) remains as at (6.4) 
except with energy term 

F K K 

E(v,h) = 



j e j(i)/=ife=i 



E E"* 6 ' 



j e j(i) fc=i 



f=l jeJ(i) \k=l / 

The parameter updating equations remain unchang- 
ed. Salakhutdinov, Mnih and Hinton (2007) report 
that this Gaussian version does not perform as well 
as the binary one; perhaps the nonlinear structure 
in (6.2) is useful for modeling the Netflix data. Bell, 
Koren and Volinsky (2007b, 2008), on the other hand, 
prefer the Gaussian model. 

SMH indicate that to contrast sufficiently among 
users, good models typically require the number of 
binary user features to be not less than about F = 
100. Hence, the dimension of the weights W, which 
is J x F x K , can be upward of ten million. The 
parameterization of W can be reduced somewhat 
by representing it product of matrices of lower 
rank, as in Wjj = Y^e=i^je-^if- This approach re- 
duces the number of W parameters to J x p x K + 
p x F, a factor of about p/F. 

There is a further point which we mention only 
briefly here, but return to in Section 11. While the 
Netflix qualifying data omits ratings, it does provide 
implicit information in the form of which movies 
users chose to rate; this is particularly useful for 
users having only a small number of ratings in the 
training set. In fact, the full binary matrix indicat- 
ing which user-movie pairs were rated (regardless of 
whether or not the ratings are known) is an impor- 
tant information source. This information is valu- 
able because the values missing in the ratings ma- 
trix are not "missing at random" and for purposes of 



the contest, exploiting this information was critical. 
It turns out that RBM models can incorporate such 
implicit information in a relatively straightforward 
way; according to Bell, Koren and Volinsky (2007b), 
this is a key strength of RBM models. For further 
details we refer the reader to SMH. 

SMH reported that, when they also incorporated 
this implicit information, RBMs slightly outperform- 
ed carefully-tuned SVD models. They also found 
that the errors made by these two types of models 
were significantly different so that linearly combin- 
ing multiple RBM and SVD models, using coeffi- 
cients determined over the probe set, allowed them 
to achieve an error rate over 6% better than Cine- 
match. The ML@UToronto team's Leaderboard score 
ultimately attained an RMSE of 0.8787 on the quiz 
set (see Table 1). 

7. NEAREST NEIGHBOR (KNN) METHODS 

Early recommender systems were based on nearest 
neighbors (kNN) methods, and have the advantage 
of conceptual and computational simplicity useful 
for producing convincing explanations to users as to 
why particular recommendations are being made to 
them. Usually applied to the residuals from a prelim- 
inary fit, kNN tries to identify (pairwise) similarities 
among users or among movies and use these to make 
predictions. Although generally less accurate than 
SVD, kNN models capture local aspects of the data 
not fitted completely by SVD or other global models 
we have described. Key references include Bell and 
Koren (2007a, 2007b, 2007c), Bell, Koren and Volin- 
sky (2007a, 2008), Koren (2008, 2010), Sarwar et al. 
(2001), Toscher, Jahrer and Legenstein (2008) and 
Wang, de Vries and Reinders (2006). See also Her- 
locker et al. (2000), Tintarev and Masthoff (2007) 
and Ungar and Foster (1998). 

While the kNN paradigm applies symmetrically 
to movies and to users, we focus our discussion on 
movie nearest neighbors, as these are the more accu- 
rately estimable, however, both effects are actually 
important. A basic kNN idea is to estimate the rat- 
ing that user % would assign to movie j by means 
of a weighted average of the ratings he or she has 
assigned to movies most similar to j among movies 
which that user has rated: 

ry1 s # _ Ej>£N(j;i) S 3,j' r i,j' 

Here the sjji are similarity measures which act as 
weights, and N(j;i) is the set of, say, K movies, 
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that i has seen and that are most similar to j. Let- 
ting I(j,f) = I (J) H be the set of users who 
have seen both movies j and f , similarity between 
pairs of movies can be measured using Pearson's cor- 
relation 



X)ie/(j,/)( r '-J - r -.j)( r i,j' - r ;i') 




or by the variant 

^i£l(j,j')( ri '3 ~ r i,-)\ r hj' ~ r v) 




in which centering is at the user instead of the movie 
means, or by cosine similarity 

y^ielijj') r tj\fT,iei(j,j') r i,j> 

The similarity measure is used to determine the near- 
est neighbors, as well as to provide the weights 
in (7.1). In practice, if an ANOVA, SVD and/or 
other fit is carried out first, kNN would be applied 
to the residuals from that fit; under such "center- 
ing" the behavior of the three similarity measures 
above would be very alike. As the common supports 
I(jij') vary greatly, it is usual to regularize the 8jj> 
via a rule such as 

, \KJJ')\ 

j ' 3 \iUJ')\ + 

A more data-responsive kNN procedure could be 
based on 

^hi = w j>f r i,j'> 

j'£N(j;i) 

where the weights Wj j> (which are specific to the ith 
user) are meant to be chosen via least squares fits 

(7.2) argmin^l ryj - ^ 

W j,j' r i',j' j • 

This procedure cannot be implemented effectively 
as shown because enough ryj' ratings are often not 
available, however, Bell and Koren (2007c) suggest 
how one may compensate for the missing ratings 
here in a natural way. 

Many variations of such methods can be proposed 
and can produce slightly better estimates, although 
at an increased computational burden; see Bell, Ko- 
ren and Volinsky (2008) and Koren (2008, 2010). 
For example, user-specific weights, with their rela- 
tively inaccurate local optimizations, could be re- 



placed by global weights having a relatively more 
accurate global optimization, as in the model 

h,j = Kj + ( r M' - Kf) W 3,? 

(7.3) 

j'£N k (j;i) 

Here uijj> is the same for all users, and the neighbor- 
hoods are now N k (j; i) = J(i) D N k (j), where N k (j) 
is the set of k movies most similar to j as deter- 
mined by the similarity measure. The sum involving 
the Ciji is included in order to model implicit in- 
formation inherent in the choice of movies a user 
rated; for purposes of the Netflix contest, this sum 
would include the cases in the qualifying data. As 
a further enhancement, the hi j following the equal- 
ity and the 6$ ,•/ within the sum could be decou- 
pled, with the second of these remaining as the orig- 
inal baseline values, and the first of these set to 
/x + a,i + bj and then trained simultaneously with the 
model. Furthermore, the sums in (7.3) could each 
be normalized, for instance, using coefficients such 
as \N k (j;i)\-y 2 . 

8. DIMENSIONALITY AND PARAMETER 
SHRINKAGE 

The large number (often millions) of parameters 
in the models discussed make them prone to over- 
fitting, affecting the accuracy of the prediction pro- 
cess. Reducing dimensionality through penalization 
therefore becomes mission critical. This leads to con- 
siderations which are relatively recent in statistics, 
such as the effective number of degrees of freedom of 
a regularized model and its use in assessing predic- 
tive accuracy, as well as to the connections between 
that viewpoint and James-Stein shrinkage and em- 
pirical Bayes ideas. In this section we attempt to 
place such issues within the Netflix context. A dif- 
ficulty which arises here stems from the distribu- 
tional mismatch between the training and validation 
data, however, we will sidestep this issue so as to 
focus on key theoretical considerations. Our discus- 
sion draws from Casella (1985), Copas (1983), Efron 
(1975, 1983, 1986, 1996, 2004), Efron et al. (2004), 
Efron and Morris (1971, 1972a, 1972b, 1973a, 1973b, 
1975, 1977), Houwelingen (2001), Morris (1983), Stein 

(1981) , Ye (1998) and Zou et al. (2007). See also 
Barbieri and Berger (2004), Baron (1984), Berger 

(1982) , Breiman and Friedman (1997), Candes and 
Tao (2007), Carlin and Louis (1996), Fan and Li 
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(2006), Friedman (1994), Greenshtein and Ritov 
(2004), Li (1985), Mallows (1973), Maritz and Lwin 
(1989), Moody (1992), Robins (1956, 1964, 1983), 
Sarwar et al. (2000), Stone (1974) and Yuan and 
Lin (2005). 

Prediction Optimism 

To give some context to our discussion, suppose Y 
is an n x 1 random vector with entries Y ; for i = 
1,2, ... ,n, all having finite second moment, and sup- 
pose the mean of Y is modeled by a vector /x(/3) with 
entries //j(/3), where j3 is a p x 1 vector of parameters. 
We will assume n(f3) is twice differentiable, and that 
it uniquely identifies /3. We also assume that there 
is a unique value of (3, namely, (3q, for which Y can 
be modeled as 

(8.1) Y i = f H (p )+e i , 

with the &i then assumed to have zero means, equal 
variances Var(ej) = a 2 , and to be uncorrelated. The 
vector /jl(/3) may or may not be based on a known, 
fixed design matrix X; all that matters about X is 
that it is considered known, and that it fully deter- 
mines the stochastic properties of Y. 

Now let Y* , with entries Y* , be a stochastically 
independent copy of Y also defined on X, that is, 
on the same experiment. We consider expectation E 
to be defined on the joint probability structure of 
(Y, Y*) or, more precisely, of (Y, Y*)|X; sometimes E 
will act on a function of Y alone, and sometimes on 
a function of both Y and Y*. Our starting point is 
the pair of inequalities 




<£]T[Y/- W (/3)] 2 , 
i=i 

which clearly will be strict, except in degenerate 
situations. The infimum in the middle expression 
is assumed to occur at the value /3 = /3o identified 
at (8.1). The infimum inside the expectation on the 
left occurs at the value of f3 denoted as /3; we inter- 
changeably use the notation (3^ n \ f3(Y) and (3^ n \Y) 
for f3 when we wish to stress its dependence on the 
sample size n, on the data Y, or on both. The 
occurring in the rightmost expression in (8.2) refers 
to entries of fi(^ n \Y)), so that the Y* and m0) 
there are independent. The inequalities (8.2) have 
the interpretation 

(8.3) ^(training error) < no 2 < ^(prediction error), 



it being understood that here the predictions /x(/5) 
are for an independent repetition Y* of the same 
random experiment. Efron (1983) refers to the dif- 
ference between prediction error and fitted error, 
that is, between the right- and left-hand sides in 
(8.2)/(8.3), as the optimism. 

It is helpful, for the sake of exposition, to examine 
the inequalities (8.2)/(8.3) for a linear model, where 
fj,(/3) = Xf3, and X is n x p. In that case, the leftmost 
and rightmost expressions in (8.2) are equidistant 
from the middle one, and (8.2)/(8.3) become 

(8.4) {n-p)a 2 <na 2 <(n + p)a 2 . 

Here the leftmost evaluation follows from the stan- 
dard regression ANOVA, and corresponds to the fact 
that unbiased estimation of a 2 requires dividing the 
training error sum of squares by n — p, while the 
rightmost evaluation follows from 

n n 

#Ek* - =eJ2[M) + < - w(/5)] 2 
i=i i=i 

n 

= na 2 +EY,lM)-Vi(P)} 2 , 
i=l 

where the last expectation here evaluates as 
E(X(3 - XP)'(XP - X0) 

= E(P -P)'(X'X)(P -P) 
= pa 2 

since (3o — j3 has mean and covariance a 2 (X' ' X)~ l . 

The inequalities (8.2)/(8.3) hold whether or not 
we have a linear model = Xf3, but the exact 
evaluations of their left- and right-most terms as 
at (8.4) do not. However, these evaluations (as well 
as their equidistances from ncr 2 ) continue to hold 
asymptotically: if the dimension of f3 stays fixed 
at p, and if the design X changes with n in such 
a way that the convergence of the least squares es- 
timate /3( n ) to fio is -^/n-consistent, then both 

(8.5) lim I no 2 - EM V[Y; - ^(/3)] 2 1 =pa 2 
and 

(8.6) lim {EY^[Y*- l x l 0)] 2 -na 2 \=pa 2 . 

I 8=1 J 

The proofs involve Taylor expanding ^(/3^ n ^) around 
/3 = Po (recall fi is twice differentiable) and following 
the proofs for the linear case; terms in the expan- 
sion beyond the linear one are inconsequential by 
the -y/n-consistency. 
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Effective Degrees of Freedom 

The distances pa 2 across both boundaries in (8.4), 
as well as at (8.5) and (8.6), lead to a natural def- 
inition for the effective number of degrees of free- 
dom of a statistical fitting procedure. In the linear 
case, faP) = X/3, using the least squares estima- 
tor $ = (X'X^X'Y, we have fi0) = Xp = HY, 
where H = X(X'X)~ 1 X' . Assuming the columns 
of X are not colinear, the matrices H and M = I — H 
project onto orthogonal subspaces of dimensions p 
and n—p. The occurrence of p at the left in (8.4) is 
usually viewed as connected with the decomposition 
Y'Y = Y'HY + Y'MY and the fact that the projec- 
tion matrix H has rank p. For a projection matrix, 
however, rank and trace are identical, but it is the 
trace which actually matters. 

To appreciate this, note that if fa is any quantity 
determined independently of Y* , then 

(8.7) E(Y* - fa) 2 = E(Y* - ^) 2 + E(fa - m) 2 . 

On the other hand, 

E(Yi - fa) 2 = E{Yi - fa) 2 + E(fa - fn) 2 

(8.8) 

-2Cov(Fi,/y. 

Taken together, and remembering that E{Y* — fii) 2 = 
E(Yi — fii) 2 , these give 

(8.9) E(Y* - fa) 2 = E(Yi - fa) 2 +2Cov(Y i ,/l i ), 

and then summing over i shows that the difference 
between the right- and the left-hand sides of (8.2) is 

n 

2^Cov(y i> £ i ). 

i=l 

Equating this with 2pa 2 leads to the definition 

n 

(8.10) effective d.f. = -s VCovf^, ft). 

i=l 

The relations (8.7), (8.8) and (8.9) hold for any es- 
timator. But if p, = HY, that is, for a linear estima- 
tor, the covariances Cov(Yi, fa) are just the diagonal 
elements of H, so that 

(8.11) effective d.f. = trace(ff). 

a 1 

For nonlinear models, the (approximate) effective 
number of degrees of freedom may be defined either 
via (8.10), or via (8.11) if we use the trace of its 
locally linear approximation faf3) ~ faPo) + H(Y — 
fafio)), with both of these definitions being justifi- 
able asymptotically in view of (8.5) and (8.6), under 
the smoothness condition referred to there. 



Example: I x J ANOVA 

To help fix ideas, it is instructive to consider the 
optimization problem for the (complete) quadrati- 
cally penalized I x J ANOVA 12 

i=l j=l 

(8.12) 

We deliberately do not penalize for [i here because /i 
is typically known to differ substantially from zero. 
We will use the identity 

/ J 

^^Zi r i,j- H-ai- P 3 ) 2 

i=l j=l 

I J 

i=l 3=1 

(8.13) 

I 

+ IJ(fi - r.,) 2 + J[<*i ~ in, - r.,)} 2 
i=i 

+ J2l[P J -(r, J -r.,)} 2 , 

where the "dots" represent averaging. It differs from 
the standard ANOVA identity, but is derived simi- 
larly, although it requires Yll=i a « = an d S/=i Pj = 
0. Using (8.13), the optimization problem (8.12) sep- 
arates, leading to the solutions 

/} = r v , 

(8.14) c\i = (r, . - r...) and 

J + A\ 



Optimal choices for the regularization parameters Ai 
and A2 in (8.12) are usually estimated by cross- 
validation, however, here we wish to understand these 
analytically. We can do this by minimizing Akaike's 
predictive information criterion (AIC), 

AIC = -21og(£ A ) + 2df(A), 



12 Unlike the SVD case, discussed in (5.2) and in footnote 8 
of Section 5, using different values for Ai and A2 is essential 
here. 



18 



A. FEUERVERGER, Y. HE AND S. KHATRI 



where C\ is the value (under A-regularization) of 
the likelihood for the {nj} at the MLE, and df(A) 
is the effective number of degrees of freedom; here 
A = (Ai, A2). As we are in a Gaussian case, with an 
RMSE perspective, this is (except for additive con- 
stants) 



the same as Mallows' C p statistic, 



C, 



{residual sum of squares}; 



+ 2df(A). 



Minimizing this will (for linear models) be equiva- 
lent to minimizing the expected squared prediction 
error, defined as the rightmost term in (8.2), or (for 
nonlinear models) to minimizing it asymptotically. 
For further discussion of these points, see Chapter 7 
of Hastie et al. (2009). 

Now, the effective number of degrees of freedom 
associated with (8.12) can be determined by viewing 
the minimizing solution to (8.12) as a linear trans- 
formation, r = H\r, from the vector r consisting of 
the observations rij, to the vector f of fitted val- 
ues fjj. The entries of the matrix H\ are deter- 
mined from the relation rjj = jjL + &i + j3j, where p,, 
&i and j3j are given at (8.14). Thus, the effective 
number of degrees of freedom, when penalizing by 
(Ai,A2), is found to be 

df = trace H\ 

(8.15) 



= 1 + (J-1 

Next, for a given Ai 
squares is 



J 



+ GJ-1) 



i 



J + Ai 'I + A 2 

and A2, the residual sum of 



/ J 

EE 

i=i j=i 



J 



hi 



r. 



J+\i 



I+X 2 

and this may be expanded as 
I J 



{r it . - r v ) 



i=i 3=1 



+ J 1 



J 



J+X 



\ 2 1 
1// 1=1 



+/ 1 



.7=1 



where the first of the three terms here may subse- 
quently be ignored. 



Hence, the C p criterion we seek to minimize can 
be taken as 



1 1 J 

i=i j=i 



j 



r. . 



J + Ai 
I 

I + X2 



(r.j — r. ; .) 



+ 2\l + (I-l] 



J 



3 HJ-D • 



J + X 



I + Xo 



or, equivalently, 



1 



(T- 



J 1 



J + X 



\ 2 1 

1 ' i=l 



+/ 1 




+2 i (/ - 1) J+». 

The minimizations with respect to J/(J + Ai) and 
1/(1 + A2) thus separate, and setting derivatives to 
zero leads to the approximate solutions 

Ai = { =7 — ; —r—. )■ and 



.16) 



Ei=i(r.j-r v )W-l) 



On substituting these into (8.15), we also see that 
under the theoretically optimal regularization the ef- 
fective number of degrees of freedom for the AN OVA 
becomes 

(J + J-l) 

7-1 a 2 



+ 



J (V(/-i))E[=i( 
j-i 



a 



r i,- ~ r -+) 
2 



1 (l/(^-l))E/=iKi-r v ) 



the expression in braces gives the reduction in de- 
grees of freedom which results under the optimal 
penalization. Equations (8.16) and (8.14) may be 
interpreted as saying that optimal penalization (or 
shrinkage) should be done differentially by parame- 
ter groupings, with each group of (centered) param- 
eters shrunk in accordance with that group's vari- 
ability (the variances of the row and column effects 
here) relative to the variability of error, and each 
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parameter in accordance with its support base (i.e., 
with the information content of the data relevant to 
its estimation — here I and J). 

Empirical Bayes Viewpoint 

The preceding computations may be compared 
with an empirical Bayes approach. For this we will 
assume that r, 



'■j 



■ n + a>i + /3j + eij , with the e^j be- 
ing independent N(0,a 2 ) variables. For simplicity, 
we assume that fj, and a 2 are known. On the param- 
eters, ai and f3j, respectively, we posit independent 
N(0,a 2 ) and iV^of) priors, with a\ and a\ be- 
ing hyperparameters. Multiplying up the I + J + IJ 
normal densities for the ai, f3j and r^j, and again 
using (8.13), we can complete squares and integrate 
out the ai and f3j. This leads to a likelihood func- 
tion for a\ and a\ which, to within a factor not 
depending on a\ and a 2 , is given by 

1 V '-±-\(j ^_ 

2a 2 )\ J+{a 2 /°D, 



2ita 



exp 



2ira 



^Ia 2 + a 2 



exp 



i=l 

I 2 



2a 2 



I - 



I+(a 2 /a 2 ) 



■E( p -j- r -.-) s 

and maximizing this leads to the estimates 



*i 2 = 7l>v 



r. . 



i=l 



— and 



a 2 



1 J a 2 



The resulting empirical Bayes Gaussian prior can 
thus be seen as being essentially equivalent to the 
quadratically penalized optimization (8.12) under 
the optimal choice (8.16) for the penalty parame- 
ters Ai, A2. 

Generalizing 

We begin with a few remarks on the penalized 
sparse ANOVA 

^2 



$.17) 



1212^ -H-ai-Pjf 



+A 2 • 



This optimization can be carried out by EM or by 
gradient descent; it has no analytical solution, but 
analogy with the complete case suggests that the 
shrinkage rules 



a] 



- shrink 



Ji 



Ji + Xi 



-ai 



and 



shrink 



Ij + A 2 



Pi, 



where &i and (3j are the unpenalized estimates, will 
be approximately optimal provided we again take Ai 
and A2 as ratios of row and column variation relative 
to error as at (8.16). Koren (2010) proposed the less 
accurate but simpler penalization 



A) 



L + A 2 



first, and then 



Ey e j(i)Ki-A-A) 



Ji + Ai 

where p, is the overall mean; typical values he used 13 
were Ai = 10 and A 2 = 25. 

For more complex models, such as sparse SVD, 
the lessons here suggest that penalties on parameter 
groupings should correspond to priors which model 
the distributions of the groups. For Gaussian priors 
(quadratic regularization) we then need estimates 
for the group variances. For SVD we thus want esti- 
mates of the variances of each of the user and movie 
features. We experimented with fitting SVDs using 
minimal regularization — with features in descend- 
ing order of importance — first removing low usage 
users to better assess the true user variation. Be- 
cause free constants can move between correspond- 
ing user and movie features, we examined products 
of the variances of corresponding features. These do 
tend toward zero (theoretically, this sequence must 
be summable) but appear to do so in small batches, 
settling down and staying near some small value, be- 
fore settling still further, again staying a while, and 
so on. Our explanation for this is that there soon 
are no obvious features to be modeled, and that 
batches of features then contribute small, approx- 
imately equal amounts of explanatory power. Such 
considerations help suggest protocols for increasing 
regularization as we proceed along features. It is an 
important point that, in principle, the number of 



13 Koren's values were targeted to fit the probe set. If the 
probe and training sets had identical statistical properties, 
these values would likely have been smaller: recall that in Sec- 
tion 4 we obtained variances of 0.23 and 0.28 for the user and 
movie means, and RMSE values slightly below 1, suggesting 
the approximate values Ai ss A2 ~ 4. 
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features may be allowed to become infinite, as long 
as their priors tend toward degeneracy sufficiently 
quickly. 

Bell, Koren and Volinsky (2007a) proposed a par- 
ticularly interesting empirical Bayes regularization 
for the feature parameters in SVD. They modeled 
user parameters as u,, ~ N(fj,, Si), movie parameters 
as Vj ~ N(y, £2), an d individual SVD-based ratings 
as Vij ~ JS (u'iVjjCr 2 ), with the natural assumptions 
on independence. They fitted such models using an 
EM and a Gibbs sampling procedure, alternating be- 
tween fitting the SVD parameters and fitting the pa- 
rameters of the prior. See also Lim and Teh (2007). 

9. TEMPORAL CONSIDERATIONS 

This section addresses the temporal discordances 
between the Netflix training and qualifying data sets. 
See, for example, Figures 1 and 5 of Section 2 for ev- 
idence of such effects. Peoples' tastes — collectively 
and individually — change with time, and the movie 
"landscape" changes as well. The specific user who 
submits the ratings for an account may change, and 
day-of-week as well as seasonal effects occur as well. 
Furthermore, the introduction (and evolution) of 
a recommender system itself affects ratings. Here 
we provide a very brief overview of the main ideas 
which have been proposed for dealing with such is- 
sues, although to limit our scope, time effects are 
not emphasized in our subsequent discussions. Key 
references here are Koren (2008, 2009). 

We first note that temporal effects can be entered 
into models in a "global" way. Specifically, the stan- 
dard baseline ANOVA can be modified to read 

n,j =(i + ati(t) + (3j(t) + eij. 

Here all effects are shown as functions which depend 
on time, but the time arguments t can (variously) 
represent chronological time, or can represent a user- 
specific or a movie-specific time £, or tj, or even 
a jointly indexed time tij. 

Time effects can also be incorporated into both 
SVD and kNN type models. An example in the SVD 
case is the highly accurate model 

rij(t) = v + ai (t) + pj(t) 




referred to as "SVD++" by Koren (2009), and fit 
using both regularization and cross-validation. Here 
the baseline values oti(t) and (3j(t), as well the user 



effects Ui(t), are both allowed to vary over time 
but — on grounds that movies are more constant than 
users — the movie effects Vj are not. The last sum 
models feedback from the implicit information. De- 
tailed proposals for temporal modeling of the user 
and movie biases, and for the user SVD factors, 
Ui(t), as well as for modeling temporal effects in 
nearest neighbor models may be found in Koren 
(2008, 2009). 

10. IN SEARCH OF MODELS 

Examining and contrasting such models as ANOVA, 
SVD, RBM and kNN is useful in a search for new 
model classes. We first remark that the best fit- 
ting models — such as SVD and RBM — have high- 
dimensional, simultaneously fitted parameterizations. 
On the other hand, useful models need not have, 
with ANOVA and kNN both suggestive of this. If 
a model has p parameters, and if it is viewed as 
spanning a p-dimensional submanifold of R N , then 
we want p to not be too large, and yet for this sub- 
manifold to contain a vector close to the expected 
A-dimensional vector of data to be fitted. For this to 
happen, the model will have to reflect some substan- 
tive aspect of the structure from whence the data 
arose. One striking feature of collaborative filtering 
data is the apparent absence of any single model 
that can explain most of the explainable variation 
observed. The reason for this may be that the avail- 
able data are insufficient to reliably fit such a model. 
Were sufficient data available, it is tempting to think 
that some variation of SVD might be such a single 
model. In this section we indicate some extensions 
to the models already discussed. Most of these were 
arrived at independently, although many do con- 
tain features resembling those in models proposed 
by others. It is to be understood that regularization 
is intended to be used with most of the procedures 
discussed. 

Extending ANOVA 

Likert scales, such as the Netflix stars system, are 
subjective, with each user choosing for themselves 
what rating (or ratings distribution) corresponds to 
an average movie, and just how much better (or 
worse) it needs to be to change that rating into 
a higher (or a lower) one. This not only speaks to 
a centering for each user, captured by ctj terms, but 
also to a scaling specific to each user, and suggests 
a variation of the usual ANOVA of the form 

(10.1) r it j = fi + ati + 7»/9j+ error. 
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The scaling factors 7$ here are meant to be shrunk 
toward 1 in regularization. This model is of an in- 
teraction type, and may be generalized to 

(10.2) Tij = fj, + cti + fy + Interact + error, 

where the Interact term in (10.2) can vary among 14 

(a) notify, (b) Jjotify, 

(10.3) 

(c) 7i5jaj or (d) n^jfy- 

While these are all interaction models, note that 
(10.1) is equivalent to a one-feature SVD with only 
a user baseline. Likewise, note that (10.3) (a) 
and (10.3) (b) can be viewed as truncated nonlin- 
ear SVDs. As an experiment, we fitted (10.1) and 
obtained an RMSE of 0.90256 on the training set as 
compared with 0.9161 from the 2-way ANOVA fit of 
Section 4. In terms of MSE, this reduction is more 
than 6 times that expected under pure randomness. 

Finally, we remark that ANOVA ideas can also be 
adapted to model probability distributions of rat- 
ings. A typical model of this type, in obvious nota- 
tion, is 

P[ ri j = k) oc exp{//W + af ] + pf ] }. 

(k) 

If we wish, the dependence of a\ on k here could 
be suppressed. The numerical issues which arise here 
are similar to those of the SVD-based multinomial 
model described below. 

Extending SVD 

Likert scales are not intrinsically linear; the dis- 
tance between a 1 and 2 rating, for instance, is 
not equivalent to the distance between a 4 and 5. 
This suggests that the five possible rating values 
might first be transformed into five other numbers, 
g(r) = gi, 52,53,54 and g 5 , say. Since SVD is scale 
but not location invariant, such transformation of- 
fers 4 degrees of freedom. The nj can thus be trans- 
formed into new data, gij , say, and an SVD fitted to 
the gi j resulting in estimates gij = u^Vj. These fits 
may then be transformed back to the original scale 
by fitting a transformation fij = hiu^Vj). 

A further nonlinear extension to SVD is arrived at 
by arguing that people and movies are not compa- 
rable entities, so requiring their descriptors to have 



equal lengths is artificial. Furthermore, users are 
much more numerous than movies, so movie features 
are easier to estimate, while user features create the 
more severe overfitting. If we posit that each user 
is governed by p features U{ = «i,2, • • • ,v>i )P ), 
and each movie by q features Vj = (^,1,^,2, • • • , v j,q), 
with p < q, we may propose models such as 

v q 

n,j = n + a.i + fy + ^2 ^2 a k,e u i,k v j,£ 

k=l l=\ 



+ ^2 ^2 h,k',m,kUj,k'Vj, 



k=ik'=i i=\ 

p q q 



(10.4) + ^2 ^2 ^2 c k,i,e' u i,k v j,£ v j,£' 



= 1 



k=l 
p p 



= 1 



+ 2J XT dk^'/^'Ui^Ui^'Vj/Vj/i 

k=ik'=i t=\ l'=\ 
+ error. 

Such models can allow for additional flexibility, and 
modest gains from the lower-dimensional parame- 
terization combined with the reduced regularization 
required. 

SVD can also be adapted to model the multino- 
mial distributions of the rij, instead of just their 
expected values. A typical model of this type is 

expju'uj} 

(10.5) P[r itj = k}= Pt y; ; 

here each movie j is associated with five feature vec- 
tors v^, one for each rating value k. Note that, ex- 
cept for the absence of ratings-dependent movie bi- 
ases, (10.5) is similar to the defining equation (6.1) 
of the RBM model. Because movies are relatively 
few compared to users, the parameterization of such 
models is not much greater than for a standard SVD. 
Furthermore, the user terms U{ in (10.5) can be mod- 
eled as sums of movie parameters, as indicated fur- 
ther below. We remark that in one of our experi- 
ments, we tried to fit such models solely using means 
and RMSE criteria, as in 



EE 



h3 



Ef =1 fcexpK^ 



+ penalty, 



14 We do not mention nj — fi + on + f3j + ji/3j which is 
equivalent to (10.1), nor do we include 7i<5j or, equivalently, 
^iSjOnPj, in (10.3), as these are just single-feature SVDs with Deeper kNN 
baseline. However, we mention here the model n , 



but encountered difficulties with convergence. 



which is a sister to (10.1), but has no convincing rationale 
behind it. Note also that within the forms (10.3), the Oi could 
be changed to \ou\ or af, and similarly for the /3j. 



The kNN models of Section 7 can loosely be de- 
scribed as one layer deep; they involve like-minded 
users and/or similarly rated movies. However, this 
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does not exhaust the combinatorial possibilities. To 
focus on one simple case, suppose user i has seen 
movies, j and j', and that we wish to predict his 
or her rating for movie j". The remaining users can 
then be partitioned into eighteen sets, according to 
whether they did or did not see each of j , j' and j" , 
and if they had seen either of j or j' , according to 
whether their ratings did or did not agree with i. 
Such partitioning can carry information relevant to 
modeling i's rating for j" , but we do not pursue 
these issues here. 

Lessons of the RBM 

We start by recalling the defining equations (6.1) 
and (6.2) for the RBM model in the form 



P(r 



h3 



= k\hi) 

Etiexp(&" + £f =1 /v^ 



(10.6) 



and 

(io.7) p{h u =i\ n )=a\b t + Yl w ?;()- 

Examining these equations leads to valuable insights. 
First, the fact that (in this version of the model) the 
hidden user features are restricted to being binary 
seems inessential, except possibly in contributing to 
regularization. In any case, binary features are as- 
sociated with probabilities, so users are, in effect, 
being described by continuously-valued quantities. 
Second, it is not clear what essential data-fitting 
advantage is offered by viewing the user features 
as being stochastic; indeed, the probabilities associ- 
ated with them may themselves be regarded as non- 
stochastic descriptors. (It may be, however, that this 
randomness proxies an underlying empirical Bayes 
mechanism.) On the other hand, having a proba- 
bility model for the ri ~ seems natural — and per- 
haps even essential — for viewing the data in a fuller 
context. Next, aside from its contribution to parsi- 
mony, and to simplifying the fitting algorithms, the 
obligatory symmetry of the weights in (10.6) 

and (10.7) seems restrictive. Finally, we remark that 
the limitation on the bias terms bj in (10.6) to de- 
pend on movie but not on user also seems restric- 
tive. The RBM model does offer certain advantages; 
in particular, it is trainable. 

It pays to consider in further detail what it is that 
the RBM equations, (10.6) and (10.7), actually do. 
The second of these equations, in effect, models each 
user's features as a function of the movies he or she 



has seen, together with the ratings they had assigned 
to those movies. Doing so limits the dimension of the 
parameterization for the user features, a highly de- 
sirable goal. On the other hand, aside from the bj 
bias terms, and aside from the stochastic nature of 
the user features, the first equation models each of 
the multinomial probabilities for the j as a func- 
tion of an SVD-like inner product of the user's fea- 
ture vector with a movie features vector (associated 
with the rating value k) whose probability is being 
modeled. 

Such considerations lead us to propose a model 
which we arrived at by adapting the RBM equa- 
tions (10.6) and (10.7) for P[rij = k\hi] and for 
P[hi,i = l\ri] into analogous equations for expecta- 
tions, namely, 

(10.8) E(r itj \hi) = g (bj + V h^WjA 



e=i 



and 
(10.9) 



E(h it e\ri) = b e + 



E 

i'eJ(i) 



Here we have separated the different roles for the 
weights by using a tilde in (10.8); and because (10.8) 
now models expectations rather than probabilities, 
the dependence of the weights on k there has been 
removed. We next propose to use the right-hand side 
of (10.9) to estimate hij, and substitute it into (10.8); 
the bias terms then all combine, and we are led to 
the single equation model 



E(r i , j )=g[b j + J2\ E W ? 
V 1=1 l j'eJ(i) 

or, generalizing this slightly, 
(10.10) 



E{ri,j)=g \ bj + weight 



F 

E 

i=i 



E 

j'eJ(i) 



This model has SVD-like weights (features) Wj£ for 
the movies, and it models each user's weights (fea- 
tures) function of the movies they have seen, 
using weights associated with the movies, but de- 
pending also on the user's ratings for those movies. 
Although derived independently, we note that Pa- 
terek (2007) proposed a related model, except that 
in Paterek's model, the movie functions which deter- 
mine the user weights [corresponding to our Wij(k) 
here] do not depend on k, that is, on the user's rat- 
ings. Our model is therefore more general, but hav- 
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ing more parameters requires different penalization. 
See also the section on asymmetric factors in Bell, 
Koren and Volinsky (2007b). 

Modeling Users via Movie Parameters 

Parsimony of parameterization is a critical issue. 
Because users are 27 times more numerous than 
movies, assigning model parameters to users exacts 
a far greater price, in degrees of freedom, than as- 
signing parameters to movies. This suggests that pa- 
rameterization be arranged in such a way that its di- 
mension is a multiple of the number of movies rather 
than of the number of users; we thus try to model 
user features in terms of parameters associated with 
movies. In the context of the Neflix contest, Paterek 
(2007) was the first to have publicly suggested this. 

However, users cannot be regarded as being alike 
merely for having seen the identical set of movies; 
their ratings for those movies must also be taken 
into account. Such considerations lead to three im- 
mediate possibilities: 

(A) There is one feature vector, vj, associated 
with each movie. The feature, Ui, for the ith user 
is modeled as a sum (variously weighted) of func- 
tions of the Vj for the movies he or she has seen, 
and the ratings he or she assigned to them. 

(B) There are two feature vectors, Vj and Vj, as- 
sociated with each movie, and it, is based on the Vj 
for the movies i has seen, together with the ratings 
assigned to them. 

(C) There are six feature vectors, vj and Vj = 
Vj(k), for k = l,2,...,K (with K = 5) associated 
with each movie, and U{ is based on the Vj for the 
movies i has seen, and the ratings k assigned to 
them. 

(There are possibilities beyond just these three.) 
These considerations lead to models such as 

h,j = V + OLi + Pj + U^Vj, 

where the Vj are free p-dimensional movie parame- 
ters, while the Uj are defined in terms of other (p-di- 
mensional) movie parameters. For the approaches 
(A), (B) and (C) mentioned above, typical possibil- 
ities include 

j'GJ(i) 

Ui = 7 x r i,j'Vj' an d 

i'eJ(i) 

Ui = jx ^ Vj>(r id ), 
j'eJ(i) 



respectively, where the 7's are normalizing factors. 
In case (C), for example, the overall model would 
become 



+ ^ X Y Y V?A r i,j') V ht- 

e=ij>eJ(i) 

Note that (10.11) is essentially the same as the RBM- 
inspired model (10.10), but alternately arrived at. 
Here the weights 7 might take the form |J(i)|~ 5 , 
with typical possibilities for S being 0, 1/2 or 1, or 
as determined by cross-validation. 

11. FURTHER IDEAS AND METHODS 

Whether driven by fortune or by fame, the tenac- 
ity of the contestants in the Netflix challenge has not 
often been surpassed. In this section we collect to- 
gether a few ideas which have not yet been discussed 
elsewhere in this paper. A very few of these are our 
own (or at least were obtained independently), but 
the boundaries between those and the many other 
methods that have been proposed are necessarily 
blurred. 

We start by noting that covariates can be included 
with many procedures, as in 

nj = n + a.i{t)+ j3j(t) 

p M 

+ Y Ui ' iv i> £ + Y ° mX ij + • ■ • > 

i=l m=l 

where the Xf 1 - for m = 1, 2, . . . ,M are covariates. 
Such models can generally be fit using gradient de- 
scent, and since regularization is typically used as 
well, it would control automatically for collineari- 
ties among covariates. Covariates introduce very few 
additional parameters; hence, as a general rule (for 
this data), the more the better. A covariate will typ- 
ically differ across both users and movies, unless it 
is viewed as being explanatory to the "row" or "col- 
umn" effects. There are many possibilities for co- 
variates. (See, e.g., Toscher and Jahrer, 2008). A se- 
lection of these include the following: 

1 . User and movie supports Jj , Ij and various func- 
tions (singly and jointly) of these. 

2. Time between movie's release date and date rat- 
ing was made. 

3. The number and/or proportion of movies user i 
has rated before and/or after rating movie j; the 
number and/or proportion of users who have rated 
movie j before and/or after user i has rated it; 
and functions of these. 
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4. The relative density of movie ratings by the user 
around the time the rating was made; also, the 
movie's density of being rated around the time of 
rating. 

5. Seasonal and day-of-week-effects. 

6. Standard deviations and variances of the user's 
ratings and of the user's residuals. Same for movies. 

7. Measures of "surge of interest" to detect "run- 
away movies." 

8. The first few factors or principal components of 
the covariance matrix for the movie ratings. 

9. Relationship between how frequently rated ver- 
sus how highly rated the movie is, for example, 
(loglj - Avg) x (3 j. 

We next remark that although they are the easi- 
est to treat numerically, neither the imposed RMSE 
criterion, nor the widely used quadratic penalties, 
are sacrosanct for collaborative filtering. A mean ab- 
solute error criterion, for instance, penalizes large 
prediction errors less harshly or, equivalently, re- 
wards correct estimates more generously, and penal- 
ties based on L 1 regularization, as in the lasso (Tib- 
shirani, 1996), produce models with fewer nonzero 
parameters. Although the lasso is geared more to 
model identification and parsimony than to opti- 
mal prediction, the additional regularization it offers 
can be useful in models with large parameterization. 
Other departures from RMSE are also of interest. 
For example, if we focus on estimating probability 
distributions for the ratings, a question of interest 
is: With what probability can we predict a rating 
value exactly? An objective function could be based 
on trying to predict the highest proportion of rat- 
ings exactly correctly. A yet different approach can 
be based on viewing the problem as one of ranking, 
as in Cohen, Schapire and Singer (1999). See also 
Popescul et al. (2001). 

A natural question is whether or not it helps to 
shrink estimated ratings toward nearby integers. 
Takacs et al. (2007) considered this question from 
an RMSE viewpoint and argued that the answer is 
no. One can similarly ask whether it helps to shrink 
estimated ratings toward corresponding modes of es- 
timated probability distributions; we would expect 
there too the answer to be negative. 

Collaborative filtering contexts typically harbor 
substantial implicit information. In many contexts, 
a user's search history or even mouse-clicks can be 
useful. As mentioned several times previously, for 
Netflix, which movies a user rated carries informa- 
tion additional to the actual ratings. Here 99% of 



the data is "missing," but not "missing at random" 
(MAR). Marlin et al. (2007) discuss the impact of 
the MAR assumption for such data. Paterek (2007) 
introduced modified SVD models (called NSVD) of 
the type 

r hj =V + ai +Pj +v'j (ui + \Ji\~ 1/2 ^2 

V feJ(i) J 

where the yj are secondary movie features intended 
to model the implicit choices a user has made. The 
Netflix qualifying data set, for example, contains im- 
portant information by identifying many cases of 
movies that users had rated, even though those rat- 
ing values were not revealed. Corresponding to such 
information is an / x J matrix which can be thought 
of as consisting of 0's and l's, indicating which user- 
movie pairs had been rated regardless of whether 
or not the actual ratings are known. This matrix is 
full, not sparse, and contains invaluable information. 
All leading contestants reported that including such 
implicit information in their ensembles and proce- 
dures made a vital difference. Hu, Koren and Volin- 
sky (2008) treat the issue of implicit information in 
greater detail. See also Oard et al. (1998). 

Measures of similarity based on correlation-like 
quantities were discussed in Section 7. Alternate sim- 
ilarity measures can be constructed by defining dis- 
tances between the feature vectors of SVD fits. Im- 
plicit versions of similarity may also be useful. For 
example, the proportion of users who have seen mov- 
ie j is and who have seen movie j' is 
Under independence, the proportion who have seen 
both movies should be about \I(j)\\I(j')\/I 2 , but is 
actually Significant differences between 
these two proportions is indicative of movies that 
appeal to rather different audiences. 

Among the most general models which have been 
suggested, Koren (2008) proposed combining the 
SVD and kNN methodologies while allowing for im- 
plicit information within each component, leading 
to models such as 

h,j = ^ + a i + ^+v' j {u i + \J(i)\- 1 ' 2 Vj) 

+ \N k {j ] i)\- l l 2 Y, Ki'-bi.fWjj' 

+ \R k (j;i)\^ Y C J,f> 
j'£R k (j;i) 

where the bi.ji are a baseline fit. Here the sum involv- 
ing the yj makes the v'jUi SVD component "implicit 
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information aware. " The sets N k {j;i) and R k (j;i) 
represent neighborhoods based on the explicit and 
implicit information, respectively, while the last sum 
is the implicit neighborhood based kNN term. This 
model is among the best that have been devised for 
the Netflix problem. 

12. IN PURSUIT OF EPSILON: 
ENSEMBLE METHODS 

Meeting the 10% RMSE reduction requirement of 
the Netflix contest proved to be impossible using any 
single statistical procedure, or even by combining 
only a small number of procedures. BellKor's 2007 
Progress Prize submission, for instance, involved lin- 
ear combinations of 107 different prediction meth- 
ods. These were based on variations on themes, re- 
fitting with different tuning parameters, and differ- 
ent methods of regularization (quadratic, lasso and 
flexible normal priors). BellKor applied such varia- 
tions to both movie and user oriented versions of 
kNN, both multinomial and Gaussian versions of 
RBM, as well as to various versions of SVD. Resid- 
uals from global and other fits were used, as were 
covariates as well as time effects. The 2008 Progress 
Prize submission involved more of the same, blend- 
ing over 100 different fitting methods, and, in par- 
ticular, modeling time effects in deeper detail; the 
individual models were all fit using gradient descent 
algorithms. Finally, the Grand Prize winning sub- 
mission was based on a complex blending of no fewer 
than 800 models. 

Several considerations underpin the logic of com- 
bining models. First, different methods pick up sub- 
tly different aspects of the data so the nature of 
errors made by different models differ; combining 
therefore improves predictions. Second, prediction 
methods fare differently across various strata of the 
data, and user behavior across such strata differs 
as well. For instance, users who rated thousands of 
movies differ from those who only rated only a few. 
If regularized (e.g., ridge) regression is used to com- 
bine estimators, the data can be partitioned accord- 
ing (say) to support (i.e., based on the Jj and/or Ij), 
and separate regressions fit in each set, with the 
ridge parameters selected using cross-validation. 
A third consideration is related to the absence of 
any unique way to approach estimation and predic- 
tion in complex highly parameterized models. In the 
machine learning literature, ways of combining pre- 
dictions from many versions of many methods are 
referred to as ensemble methods and are known to 
be highly effective. (See, e.g., Chapter 16 of Hastie, 



Tibshirani and Friedman, 2009.) In fact, the Net- 
flix problem provides a striking and quintessential 
demonstration of this phenomenon. 

Various methods for blending (combining) models 
were used to significantly improve prediction per- 
formance in the Netflix contest and are described in 
Toscher and Jahrer (2008), Toscher, Jahrer and Bell 
(2009) and Toscher, Jahrer and Logenstein (2010). 
These include kernel ridge regression blending, kNN 
blending, bagged gradient boosted decision trees and 
neural net blending. Modeling the residuals from 
other models provided useful inputs, for example, 
applying kNN on RBM residuals. It was found that 
linear blending could be significantly outperformed 
and that neural net blending combined with bagging 
was among the more accurate of the proposed meth- 
ods. Essentially, individual models were fit on the 
training data while blending was done on the probe 
set, as it represented the distribution of user /movie 
ratings to be optimized over. In their winning sub- 
mission, Toscher, Jahrer and Bell (2009) noted that 
optimizing the RMSE of individual predictors is not 
optimal when only the RMSE of the ensemble counts; 
they implemented sequential fitting-while-blending 
procedures as well as ensembles-of-blends. Further 
details may be found in the cited papers. Some gen- 
eral discussion of ensemble methods is given in Hastie 
et al. (2009), Chapter 16. See also DeCoste (2006), 
and Toscher, Jahrer and Legenstein (2010). 

It should be noted that although the Netflix con- 
test required combining very large numbers of pre- 
diction methods, good collaborative filtering proce- 
dures do not. In fact, predictions of good quality can 
usually be obtained by combining a small number of 
judiciously chosen methods. 

13. NUMERICAL ISSUES 

The scale of the Netflix data demands attention to 
numerical issues and limits the range of algorithms 
that can be implemented; we make a few remarks 
concerning our computations. We used a PC with 
8 GB of RAM, driven by a 3 GH, four-core, "Intel 
Core 2 Extreme X9650" processor. Our computa- 
tions were mainly carried out using compiled C++ 
code called within a 64 bit version of MatLab run- 
ning on a 64 bit Windows machine. 

For speed of computation, storing all data in RAM 
was critical. Briefly, we did this by vectorizing the 
data in two ways: In the first, ratings were sorted 
by user and then by movie within user; and in the 
second, conversely. A separate vector carried the rat- 
ings dates. Two other vectors carried identifiers for 
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the users and for the movies; those vectors were 
shortened considerably by only keeping track of the 
indices at which a new user or a new movie began. 
Ratings were stored as "single" (4 bytes per data 
point, so 400 MB for all ratings) and user number as 
"Int32" (long integer, using 400 MB). As not all vari- 
ables are required by any particular algorithm, and 
dates often were not needed in our work, we could 
often compress all required data into less than 1 GB 
of RAM. Takacs et al. (2007) also discuss methods 
to avoid swapping data across ROM. 

Except for the RBM, we implemented many of 
the methods discussed in this paper, as well as many 
others proposed in the literature. In general, we found 
that gradient descent methods (stopping when RMSE 
on the probe set is minimized) worked effectively. 
For SVD, for example, one full pass through the 
training data using our setup took approximately 
3 seconds; thus, fitting a regularized SVD of rank 40 
(which required approximately 4000-6000 passes 
through) took approximately 4-6 hours. 

14. CONCLUDING REMARKS 

The Netflix challenge was unusual for the breadth 
of the statistical problems it raised and illustrated, 
and for how closely those problems lie at the fron- 
tiers of recent research. Few data sets are available, 
of such dimensions, that allow both theoretical ideas 
and applied methods to be developed and tested to 
quite this extent. This data set is also noteworthy 
for its potential to draw together such diverse re- 
search communities. It is to be hoped that similar 
contributions could be made by other such contests 
in the future. 

In this paper, we discussed many key ideas that 
have been proposed by a large number of individu- 
als and teams, and tried to contribute a few ideas 
and insights of our own. To provoke by trivializing, 
let us propose that there is one undercurrent which 
underlies most of what we have discussed. Thus, 
ANOVA/baseline values can all be produced by the 
features of an SVD. Likewise, fits from an SVD can 
be used to define kNN neighborhoods. And finally, 
the internal structure of an RBM is, in essence, anal- 
ogous to a kind of SVD. Hence, if there a single un- 
dercurrent, it surely is the SVD; that, plus covari- 
ates, plus a variety of other effects. What compli- 
cates this picture are the dimensions of the problem 
and of its parameterization, together with the ensu- 
ing requirements for regularization, and the difficul- 
ties (particularly the inaccuracies) of the estimation. 



For the Netflix problem, an interesting question 
to speculate on is: What is the absolutely best at- 
tainable RMSE? At one time, the 10% improve- 
ment barrier seemed insurmountable. But the al- 
gorithms of the winning and runner up teams ulti- 
mately tied to produce a 10.06% improvement (Test 
RMSE 0.8567) over the contest's baseline. When the 
prediction methods of these two top teams is com- 
bined using a 50/50 blend, the resulting improve- 
ment is 10.19% (Test RMSE 0.8555); see http:// 
www . the-ensemble . com. 

The Netflix challenge also raises new questions. 
Some of these have already been under active re- 
search in recent years, while others pose new ques- 
tions of problems that had been thought of as hav- 
ing been understood. For example, in the context 
of data sets of this size, how can one deal most ef- 
fectively with optimization under nonconvexity, as 
occurs, for instance, in very sparse SVD? Can bet- 
ter algorithms be devised for fitting RBM models, 
for having them converge to global optima, and for 
deciding on early stopping for regularization pur- 
poses? Furthermore, currently available theoretical 
results for determining optimal cross-validation pa- 
rameters are based on contexts in which the dis- 
tributions of the training data and of the cases for 
which predictions are required are the same. Can 
these theoretical results be effectively extended to 
cover cases in which the training and test sets are 
not identically distributed? The Netflix problem also 
highlights the value of further work to gain still 
deeper understanding of issues and methods sur- 
rounding penalization, shrinkage and regularization, 
general questions about bagging, boosting and en- 
semble methods, as well as of the trade-offs between 
model complexity and prediction accuracy. Related 
to this are questions about choosing effective pri- 
ors in empirical Bayes contexts (particularly if the 
number of parameters is potentially infinite), and 
of the consequences of choosing them suboptimally. 
What, for example, are the trade-offs between us- 
ing a regularized model having a very large num- 
ber of parameters, as compared to using a model 
having still more parameters but stronger regular- 
ization? For instance, if two SVD models are fit us- 
ing different numbers of features, but with penal- 
ization arranged so that the effective number of de- 
grees of freedom of both models is the same, can one 
deal theoretically with questions concerning which 
model is better? And finally, can general guidelines 
be developed, with respect to producing effective 
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ensembles of predictors, which apply to modeling 
of large data sets requiring extensive parameteriza- 
tion? Such questions are among the legacies of the 
challenge unleashed by the Netflix contest. 
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